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Abstract — We consider sensor scheduling as the optimal ob- 
servability problem for partially observable Markov decision 
processes (POMDP). This model fits to the cases where a Markov 
process is observed by a single sensor which needs to be 
dynamically adjusted or by a set of sensors which are selected 
one at a time in a way that maximizes the information acquisition 
from the process. Similar to conventional POMDP problems, in 
this model the control action is based on all past measurements; 
however here this action is not for the control of state process, 
which is autonomous, but it is for influencing the measurement of 
that process. This POMDP is a controlled version of the hidden 
Markov process, and we show that its optimal observability 
problem can be formulated as an average cost Markov decision 
process (MDP) scheduling problem. In this problem, a policy is a 
rule for selecting sensors or adjusting the measuring device based 
on the measurement history. Given a policy, we can evaluate 
the estimation entropy for the joint state-measurement processes 
which inversely measures the observability of state process for 
that policy. Considering estimation entropy as the cost of a 
policy, we show that the problem of finding optimal policy is 
equivalent to an average cost MDP scheduling problem where 
the cost function is the entropy function over the belief space. 
This allows the application of the policy iteration algorithm for 
finding the policy achieving minimum estimation entropy, thus 
optimum observabiUty. 

I. Introduction 

Sensor scheduling aims to achieve optimum visibility of 
the sensing process. This problem arises in situations where a 
number of sensors are set to measure a process, and different 
sensors provide different visibility depending on the states of 
the process. However, the sensor management strategies, band 
limited communications, or the network itself only allows one 
sensor reading at a time. An example of such a network is a 
set of sensors deployed densely at a region to track moving 
targets, and only one can communicate at a time. Depending 
on the target position, sensors will have different visibility 
of the target, each providing a vague estimate of the target 
position. Another example is the waveform selection for a 
radar system, where various waveforms have different effects 
for target observation ([1],[2]). In these systems or networks 
a sensor selection policy needs to be implemented to ensure 
maximum information flow from the process to the observer. 
One the other hand, if the parameters of a single measuring 
(sensing) device can be adjusted where different values of the 
parameters provide better observability at different states of 
a process, then finding the optimal on-line adjustment policy 
can also be considered as a sensor scheduling problem. 



Networks of these kinds when the process is a Markov 
process and measurements are memoryless can be modelled 
as POMDP. However in contrast to usual applications of 
POMDPs as a control process problem, here we deal with a 
different kind of problem that we call it observability problem. 
In such a problem the Markov process is autonomous and 
the action doesn't influences its evolution. Instead our action 
influences the visibility of the Markov process which could 
also depend on its state. Similar to the controllability problem, 
here also a policy is a rule for choosing the action based on 
the belief on the Markov process. This belief is built up by 
the measurement history as a variable over an infinite state 
space. The problem of finding the optimal policy for both 
of these problems can turn to a Markov decision scheduling 
problem where the Markov process is the state of belief on its 
infinite space. In contrast to the controllability problem that 
the cost function over the belief space is a linear function [3] 
(irrespective of the cost associations to the states of Markov 
process), in the observability problem the cost function cannot 
be a linear function. In the latter problem the aim is to 
control the belief state to move only between almost sure (low 
entropy) regions of belief space. In this paper we formulate 
the observability problem as an average entropy (as cost) MDP 
scheduling problem. 

An average cost MDP scheduling problem finds an optimal 
policy that minimizes the expected time average cost of 
the controlled Markov process [4]. When the state space of 
MDP is finite, value iteration algorithm and policy iteration 
algorithm (PIA) are simple methods to obtain the optimum 
scheduling policy. However for general state spaces these 
algorithms are not applicable or easy to implement. For the 
linear cost function, the value iteration algorithm has been 
implemented as an iterative application of linear programming, 
in particular using incremental pruning [5]. On the other hand, 
for the average cost MDP scheduling problem with non-linear 
cost function over general spaces the policy iteration algorithm 
has been considered as iteratively solving a version of Poisson 
equation, and the convergence under some conditions has been 
verified for this algorithm [6]. 

In this paper the basic relation between POMDP and its 
corresponding MDP on belief state is captured by introducing 
H-processes. In the observability problem, for each policy of 
the controlled H-process the estimation entropy is defined as 
the limit of conditional entropy of the hidden component given 



all the past observable component [7]. The effectiveness of a 
policy can be judged by this entropy measure. We show that 
under ergodicity condition this limiting entropy is the average 
cost of the MDP under that policy when the cost function 
is the entropy function over the belief space. This results in 
the formulation of the observability problem as an average 
cost MDP scheduling problem over belief space for which we 
can apply the results of [6] to implement the policy iteration 
algorithm for solving optimal policy. 

In the next section we discuss advanced results on Markov 
processes and their controlled version, including ergodicity 
conditions and Poisson equation. The policy iteration algo- 
rithm is also described in this section. In Section III we 
describe the observability problem using H-processes. The 
estimation entropy as the cost function for the observability 
problem is then analyzed in Section IV. In the last section the 
exact MDP scheduling problem for sensor scheduling adapted 
to the application of policy iteration algorithm is formalized. 

In this paper the domain of a random variable X is denoted 
by X if it is a general space, or by X if it is a finite set. A 
discrete time stochastic process is denoted by X = {Xn ■ n G 
Z}. For a process X, a sequence of Xq, Xi, ...Xn is denoted 
by Xg , whereas X" refers to X''!^. The probability Pr{X = 
x) is shown by p{x) (similarly for conditional probabilities), 
whereas p{X) represents a row vector as the distribution of X, 
ie: the fc-th element of the vector p{X) is Pr{X = k). For a 
random variable X defined on a set X, we denote by V^r the 
probability simplex in mI"^'. A specific elements of a vector 
or matrix is referred to by its index in square brackets. The 
entropy of a random variable X is denoted by H{X) whereas 
h -.V X ^ i?+ represents the entropy function over V i-e: 
h{p{X j) ~ H{X) for all possible random variables X on X. 

II. Markov Decision processes and Policy 
Iteration Algorithm 

A time homogeneous Markov chain X = {Xn : n ^ Z} 
on the general space X is defined by a conditional probability 
P{x, B), X e X, B G B(K). For a given Markov chain and 
for a cost function c : X M+, the following functions are 
defined as discounted cost and average costs, respectively. 



Jix) 



limsupE^Jo ^E[c(XO|Xo = a^] 



(1) 



where Xt evolves based on P and < a < 1 is the discount 
factor. 

A Markov decision process (MDP) is defined by a set of 
conditional probability Pa{x,B), x G X,B G B{X),a G 
A, where A is the control set. A Markov decision process 
with a control policy w : X ^ ^ is a Markov process 
with Pt^,{x,B) = P^(^x-^{x,B). For a given cost function 
c : X ^ the discounted cost a;) and average 

costs J{w,x) are also functions of the policy w and are 
defined by Equation Q where Xt evolves based on P^. A 

'in more general settings the cost function is c : X X ^ R. The results 
are easily extended to this case if c = ci + C2, ci : X ^ R, C2 : — ► K. 



MDP scheduling problem is to find an optimal policy w* that 
minimizes one of these cost criterion, (for all x G X), [4], 



— argminF(ii;, x) Discounted cost problem 



(2) 



w* = argmm J{'w, x) Average cost problem 

Let i?(X) be the set of real-valued bounded measurable 
functions on X, and A4 (X) be the set of probability measures 
on X. For a conditional probability P, we define two opera- 
tions on B{X) and M{T) as follows [4], 



Pc{x) 



c{y)P{x,dy), 



HP{B) = j P{x,B)fi{dx) 



(3) 



(4) 



for c G B(X) and ^ G X(X). We see that if ^ is the 
distribution of Xt, then p,P is the distribution of Xt+i. Also, 

Pcix) ^E[ciXt+i)\Xt=x]. (5) 
The measure /i is an invariant measure of P if 

i.e: an invariant measure of P is a fixed point of the (left) 
operator of P in @. 

For a given weight function m : X ^ [1, oo) the u-norm of 
P is defined as 

\\P\\u^snpu{x)-' f u{y)\P{x,dy)\. 



where |P(a;, .)| is the total variation of the measure P(x, .). 
If for a u, \\P\\u < 1, then the mappings defined in (|3} and 
(0} are contractions. 

The Banach 's Fixed Point Theorem states that a contraction 
map on a complete metric space has a unique fixed point. 

The Mean Ergodic Theorem states that, if (x is the unique 
invariant measure of P, then 



N-l 

= lim — V P'c = / cda. 



(6) 



and c* is a constant (/i almost-everywhere). 

For a given conditional probability P on X, and a fixed 
function c : X ^ M+ , the system of equations 



g + f =c + Pf, 
9 =P9- 



(7) 



is called the Poisson's equation. For a solution {g, /), the 
function g is called the invariant (or harmonic) function, and 

g = lim — P*g = lim — \^ P*c, 



Af^oo N ^ 
t=0 



t=0 



where the second equality is true only if P"^ f /n ^ 0. If /i is 
the unique invariant measures of P, then g = c* in (|6j. 

For the Markov decision process we consider this version 
of Poisson Equation (a as a constant) 



a + f{x) = min[c(a;) + Paf{x)]. 



(8) 



The following theorem gives a condition for the optimal policy 
of a Markov decision process [6], [4]. 

Theorem 1: For the average cost MDP scheduling problem 
with cost function c, if / solves Equation (|8j for some constant 
a, then the policy 



w* {x) 



argmin[c(a:) + Paf{x)] 



(9) 



is optimal conditioned that 

ip^/(a;) -> 0, n^^. (10) 
An algorithm for finding the optimal policy is the Policy 
Iteration Algorithm. This algorithm iteratively generates a 
sequence of policies w„ starting from an initial policy wq. 
Given w„ in its n-th iteration, the algorithm finds w„+i by 
the following two steps. 

• For the Markov chain with conditional probability /\u„, 

solve equation for /„ (up to a constant). 
. Find Wn+i by Wn+i{x) = argmin[c(a;) + Pafn{x)]. 

It is shown that for finite state space X, the sequence of 
policies Wn converges to the optimal policy satisfying the 
optimality condition (|9}. For general state spaces sufficient 
conditions for convergence has been given in [6]. 

III. The Optimal Observability Problem 

Consider a pair of correlated processes (S, Z) with finite 
domain sets S and Z, respectively. We define two random 
vectors 7r„ and p„ as functions of on the domains 

V5, V z, respectively. 



7r„(Z"-i)=p(5„|Z"-i). 



p„(Z"-i)=p(Z„|Z"-i). 



(11) 



(12) 



According to our notation, the random vector 7r„ (similarly 
/9„) has elements 7r„[fc] = p(5„ = ^Z"-^), k = 1,2, \S\. 

For a hidden Markov process it is shown [7] that there are 
mappings ( : V5 V2 and t] : Z x V5 V5 such that for 
any n, 

(13) 



The results on the analysis and representation of such a process 
can be extended to any pair of joint processes for which the 
mapping ( and 77 exist irrespective of the map definitions. The 
existence of such mappings in fact implies that such a joint 
process can be described by an iterated function system [8], [7]. 
Therefore we take the virtue of the existences of mappings 
(I13> as the core property of special group of joint processes 
that we call H-processes. In this paper, this definition helps 
to bridge the partially observed Markov decision problems to 
simpler Markov decision problems. 

Definition 1: A pair of correlated processes (S, Z) is called 
an H-process if the sequences 7r„ and p„ are related by some 
mappings Q and -q as in M3\ . 



We refer to S as the hidden component and Z as the observable 
component of the H-process. An example of H-process is the 
hidden Markov process. 

A key property of H-process is that 7r„ is a Markov chain 
on V5 with the transition probabihty P{x, B) defined by 



\z\ 



p{x,B) = Y^iB{v{i.x))a^ 



(14) 



1=1 



(where 1_b(.) is the set identity function), or equivalently by 



P{x,x') 



{C{x))[l] ,x' = 7^{l,x), l = l,2,...,\Z\ 
otherwise. 



(15) 

In fact the left operation of P in @ represents the evolution of 
'distribution of 7r„'(denoted by /x„) as probability measures on 
V5, i.e: pn+i = PnP^ which is easy to verify by the iterated 
function system representation of such processes [7, Eq. 1]. 

The controlled version of H-process can be described by 
considering an action process An £ (-4 is the control set) 
and the existence of mappings and rja, Va G A, such that 

Pn =Ca„(7r„), 



(16) 



A controlled H-process defines a Markov decision process on 
V5 with action set A, and conditional probability for each 
action a as 



\z\ 



Pa{x,B) =J2MVa{l,x))Ca{xm 



(17) 



1=1 



For a controlled H-process a stationary control policy is a 
function w : V5 A which deterministically connects An 
to 7r„, An = w{TTn)- The pair (S, Z) when controlled by a 
policy w is an H-process with mappings, 



(18) 



Cw{x^ — Cw{x){x^ 

■qw{z,x) = ri^(^){z,x). 

Hence, a controlled H-process with policy w defines a Markov 
chain on V5 with conditional probability 

Pu,{x,B) = P^(^){x,B). 

Now using H-process, we show that the optimal observ- 
ability problem of POMDP,s can be considered as finding an 
optimal policy for a Markov decision process. 

A POMDP is a controlled version of a hidden Markov 
process (HMP). Consider an HMP [7] with state transition 
probability matrix Q, representing the dynamics of the process, 
and measurement (emission) matrix T, representing the sensor 
A POMDP is an HMP when these matrices are functions 
of a control action a G A, and we chose the action based 
on our past and current observations for either controllability 
or observability purposes. In the controllability problem the 
aim is to control the state process to move it towards more 
favorite states, but in the observability problem the state 
process is autonomous and the aim is to dynamically adjust 
the measuring apparatus to have the best observation of the 



state process. In both problems the control action is based 
on 7r„ which represents the belief on the state, and it is a 
sufficient statistics for past observations. A control policy is a 
rule for choosing actions based on this belief for prospective 
minimization of the average (or discounted) expectation of a 
cost function over belief space. For the control problem the 
cost function is c{tt) = vr/?', where /3 e mI'^I is a fixed vector 
(states' costs). Hence for the control problem the cost function 
is linear This case is not true for the observability problem. 

To formulate the observability problem as a Markov deci- 
sion problem, first we consider that HMP is an H-process with 
the following mappings [7], 

C(^) = ttT, 



IV. The Estimation Entropy 
The entropy rate of a process Z is defined as 

Hz = lim -H{Z'^'), (21) 
when the hmit exists. From chain rule, we can write 

H{Z,\Z^,-') = H{Z^) - H{Z^-'). 

We see that the entropy rate is the limit of Cesaro mean of 
the above i-sequence, i.e: 



n 

Hz= lim - Vi?(Z,|Z^-i). 

n^oo 71 ^ — ^ 



(22) 



77(z,7r) 



(19) 



ttD{z)1 ' 

where D{z) is a diagonal matrix with dk.k{z) ~ T[k, z], k = 
1, 2, .., We also note that in the controllability problem it 
is (usually only) Q that is a function of the action a, whereas 
in observability problem T is a function of a. As a result in 
the controllability problem usually C is fixed but we have a set 
of functions rja, similar to ( I19> with Q{a) instead of Q. These 
define a controlled H-process which corresponds to an MDP. 
The linearity of cost function helps to solve this problem by 
value iteration algorithm using incremental pruning [5]. 

In contrast, for the observability problem the POMDP is a 
controlled H-process with the following mappings 

Ca(7r) = TTT{a), 



ttD{z, a)Q 
■kD{z, a)l 



(20) 



where D{z, a) is a diagonal matrix with dk,k{z) — T{a)[k, z], 
k = 1,2,..,|5|. Moreover, the cost function on V5 cannot 
be a linear function. The (positive) cost function needs to be 
designed in such a way that imposes higher cost when the 
belief 7r„ moves away from the vertices of V5. At the vertices, 
the belief about state is complete certainty, thus zero cost. 
A policy is optimal which ensures that as the state process 
evolves autonomously, the belief 7r„ in its expectation hops 
only between the regions close to the vertices, so the average 
ambiguity about state (cost) is minimized. 

In previous works [9], [2], the cost function c(7r) = 1 — tttt' 
has been considered for this problem due to its ease of 
approximation by piecewise linear (and zero cost at vertices), 
hence allowing application of dynamic programming under 
discounted criterion. In this paper we consider the entropy 
function as the cost function, c(7r) = /i(7r), and average cost 
criterion, hence allowing application of PIA. An interesting 
result is that under ergodicity condition the average cost 
J{w, x) is the same as the estimation entropy defined in [7] for 
the H-process (S, Z) corresponding to w. We first discuss this 
relation in the next section and then formalize the observability 
problem of sensor scheduling as an MDP scheduling problem. 



For an H-process we define the entropy rate as the entropy 
rate of its observable component. Along the line of (I22> . the 
estimation entropy [7] for an H-process is defined as 

1 " 

Hs,z^ lim -Vi?(5,|Zri). (23) 

i=l 

We know that if a sequence a„ converges, then the sequence 
of its Cesaro mean (i.e: l/n^"ai) also converges to the 
same limit [10, Theorem 4.2.3]. As a result for stationary 
H-processes, the entropy rate and estimation entropy can be 
written as 



n — l\ 
) 



H 



s/z 



lim H{Z,,\Z, 

n—^OG 

lim H{Sn\Z'^-^ 



(24) 



From its definition we see that the estimation entropy for 
an H-process is the limit of running average of residual 
uncertainty about the hidden component under the knowledge 
of all past observed process, thus it inversely measures the 
observability of the hidden process. However we also show 
that under ergodicity conditions the estimation entropy is the 
long run average entropy of the belief process. To this end, 
using the relations in section II, we can prove the following 
interesting results [11]. 

Lemma 1: For an H-process, 



H{S„\Z^-\^o=x)=P''h{x), 



(25) 



where P is defined by (I14> and h is the entropy function. 

Theorem 2: For an H-process, if P has unique invariant 
measure /i, then 



n-l 

Hs,z = lim - V Ph = / 

t=0 ''^ 



hdfi. 



(26) 



For a hidden Markov process which has primitive matrix Q 
and also matrix T has nonzero elements, it is shown that the 
integral expression in i26\ for Hs/z is true for any attractive 
and invariant measure /i [7, Theorem 1]. 

According to Q and Theorem |2j we see that under ergod- 
icity condition (existence of a unique invariant measure) the 
estimation entropy of an H-process is the average cost of a 



Markov chain with conditional probability P in ilAl and the 
cost function as the entropy function h, 

^ n — 1 

Hs/z = lim - V E[hiXt)\Xo = x] = J{x). (27) 



For a controlled H-process the estimation entropy as a func- 
tion of policy w is this average cost for the Markov chain 
corresponding to P^, i.e: Hg/ziw) — J{w,x),\/x, provided 
that Pw has a unique invariant measure. 

Using the Banach's fixed point Theorem, we see that if 
||-ftu||ti < 1 for some u : V5 [l,oo), then P^ has unique 
invariant measure. By this we infer that a sufficient condition 
for the existence of a unique invariant measure of P^ is that 
for some u : V5 [1, 00), 



u{T]y,{z,x))Cw{x)[z\ < u{x), 



Vx e Vs. (28) 



Although according to (I27> under ergodicity condition (for 
any w) the POMDP observability problem for minimum 
estimation entropy Hs/z{w) is equivalent to the average cost 
MDP scheduling problem for J(w, x), we consider the sensor 
scheduling problem only as the average cost MDP problem. 
This MDP is defined by the set of conditional probabilities 
Pa in (I17> . where (a and r/a are defined by i20\ . In a 
similar problem, minimum estimation entropy criterion has 
been considered for finding the optimal policy of multiple 
measurement hidden Markov processes [12]. This analysis 
required the ergodicity of P^ for any w. Under such a 
condition, for that problem an optimality criterion simpler than 
Theorem 1 has been conjectured. 

V. Sensor Scheduling for Optimal Observability 

Using the POMDP observability problem as the framework 
for sensor scheduling, here we formalize this scheduling 
as an average cost MDP scheduling problem. Let M, L, A 
be the cardinality of state, measurement, and control sets, 
respectively. The integer A represents the number of sensors. 
We also have a M x M state transition probability matrix Q, 
and a set of M X L measurement (sensor) probability matrices 
Ta, a=l,2,...,A. The description of MDP scheduling framework 
based on [6] for the sensor scheduling problem is as follows: 

• The Markov decision process evolves on the state space 
X, where X is the probability simplex in M*^. 

• The action set is ^ = {1,2,. ..A}, and the admissible 
action set for any x G X is ^. 

• The set of conditional probability distributions 
Pa{x, S), a G A is defined by 



Pa{x,B) 



\z\ 

E 

1=1 



where 



lB{Va{l,x)){xTa) 

^ xDa{l)Q 

xDa{l)V 



(29) 



(30) 



and Da{l) is a diagonal matrix with m-th diagonal 
element being Ta[m, I], m — 1,2, .., M. 



« For any stationary policy w : X ^ A the state process 
X{w) = {Xt{w) : i G Z} is a Markov chain with 
conditional probability Pw{x,B) — P^(^^-^{x,B). 

> The average cost of a policy w for a given initial condition 

a; is ^ Ar_i 

J{w,x)^ lim sup ^ E [^(^* )] ' ^31) 



N- 



t=0 



where the cost function c 
entropy function defined by 



X 



[0,log(A/)] is the 



c(x) — h{x) = — x[m] loga;[m]. 

m 

Objective: Find the optimal policy w* where J{w*,x) < 
J{w,x) for all polices w and any initial state x. 

As an average cost MDP scheduling problem the objective 
can be achieved by policy iteration algorithm [6]. The solution 
w* will then be the optimum observation policy. Future 
projection is an adaptation of PIA and a rigorous analysis 
of convergence for the algorithm under the conditions of this 
problem, using the results of [6]. 
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